Visualization of next generation sequencing (NGS) data at various genomic features on a genome-wide scale provides an effective way of exploring and communicating experimental results on one hand, yet poses as a tremendous challenge on the other hand, due to the huge amount of data to be processed. Existing software tools like deeptools, ngs.plot, coverageView and metagene2, while having attractive features and perform reasonably well in relatively simple scenarios, like plotting coverage profiles of fixed genomic loci or regions, have serious limitations in terms of efficiency and flexibility. For instance, deeptools requires 3 steps (3 sub-programs to be run) to generate plots from input files: first, convert .bam files to .bigwig format; second, compute coverage matrix; and last, plot genomic profiles. Huge amount of intermediate data are generated along the way and additional efforts have to be made to integrate these 3 closely related steps. All of them focus on plotting signals within genomic regions or around genomic loci, but not within or around combinations of genomic features. None of them have the capability of performing statistical tests on the data displayed in the profile plots.
To meet the diverse needs of experimental biologists, we have developed GenomicPlot using rich resources available on the R platform (particularly, the Bioconductor). Our GenomicPlot has the following major features:
The lengths of each part of the genes are prorated based on the median length of 5’UTR, CDS and 3’UTR of protein-coding genes in the genome. The total length (including upstream and downstream extensions) are divided into the specified number of bins. Subsets of genes can be plotted as overlays for comparison.
library(GenomicPlot, quietly=TRUE)
txdb <- AnnotationDbi::loadDb(system.file("extdata", "txdb_chr19.sql", package="GenomicPlot"))
gf <- prepare_5parts_genomic_features(txdb, meta=TRUE, nbins=100, fiveP=-2000, threeP=1000,
longest=TRUE)
queryfiles <- system.file("extdata", "treat_chr19.bam", package="GenomicPlot")
names(queryfiles) <- "clip_bam"
inputfiles <- system.file("extdata", "input_chr19.bam", package="GenomicPlot")
names(inputfiles) <- "clip_input"
importParams <- list(offset=-1, fix_width=0, fix_point="start", norm=TRUE,
useScore=FALSE, outRle=TRUE, useSizeFactor=FALSE, genome="hg19")
plot_5parts_metagene(queryFiles=queryfiles,
gFeatures_list=list("metagene"=gf),
inputFiles=inputfiles,
scale=FALSE,
verbose=FALSE,
transform=NA,
smooth=TRUE,
stranded=TRUE,
outPrefix=NULL,
importParams=importParams,
heatmap=TRUE,
rmOutlier=0,
nc=5)Individual sample profile and heatmap
Profile overlay
Ratio-over-input profile and heatmap
The lengths of each part of the profile plot are prorated to the lengths of upstream and downstream extensions as well as the median length of protein-coding genes in the genome. The total length (including upstream and downstream extensions) are divided into the specified number of bins.
txdb <- AnnotationDbi::loadDb(system.file("extdata", "txdb_chr19.sql", package="GenomicPlot"))
gf <- prepare_3parts_genomic_features(txdb, meta=FALSE, nbins=100, fiveP=-3000, threeP=2000,
longest=TRUE)
queryfiles <- system.file("extdata", "chip_treat_chr19.bam", package="GenomicPlot")
names(queryfiles) <- "chip_bam"
inputfiles <- system.file("extdata", "chip_input_chr19.bam", package="GenomicPlot")
names(inputfiles) <- "chip_input"
importParams <- list(offset=0, fix_width=150, fix_point="start", norm=TRUE,
useScore=FALSE, outRle=TRUE, useSizeFactor=TRUE, genome="hg19")
plot_3parts_metagene(queryFiles=queryfiles,
gFeatures=gf,
inputFiles=inputfiles,
scale=FALSE,
verbose=FALSE,
transform=NA,
smooth=TRUE,
stranded=TRUE,
outPrefix=NULL,
importParams=importParams,
heatmap=TRUE,
rmOutlier=0,
nc=5)Individual sample profile and heatmap
Profile overlay
Ratio-over-input profile and heatmap
Signal profile along with heatmaps in genomic features or user defined genomic regions provided through a .bed file or narrowPeak file can be plotted. Multiple samples (.bam files) and multiple set of regions (.bed file) can be overlayed on the same figure, or can be output as various combinations. When input files (for input samples) are available, ratio-over-input are displayed as well. Statistical comparisons between samples or between features can be plotted as boxplots or barplots of means 0b1 S.E.
centerfiles <- system.file("extdata", "test_chip_peak_chr19.narrowPeak", package="GenomicPlot")
names(centerfiles) <- c("NarrowPeak")
queryfiles <- c(system.file("extdata", "treat_chr19.bam", package="GenomicPlot"),
system.file("extdata", "chip_treat_chr19.bam", package="GenomicPlot"))
names(queryfiles) <- c("clip_bam", "chip_bam")
inputfiles <- c(system.file("extdata", "input_chr19.bam", package="GenomicPlot"),
system.file("extdata", "chip_input_chr19.bam", package="GenomicPlot"))
names(inputfiles) <- c("clip_input", "chip_input")
importParams <- list(offset=0, fix_width=150, fix_point="start", norm=TRUE,
useScore=FALSE, outRle=TRUE, useSizeFactor=FALSE, genome="hg19")
plot_region(queryFiles=queryfiles,
centerFiles=centerfiles,
inputFiles=inputfiles,
nbins=100,
heatmap=TRUE,
scale=FALSE,
regionName="narrowPeak",
importParams=importParams,
verbose=FALSE,
fiveP=-500,
threeP=500,
smooth=TRUE,
transform=NA,
stranded=TRUE,
outPrefix=NULL,
rmOutlier=0,
nc=5)Individual sample profile
Individual sample coverage stats in narrowPeak
Individual sample coverage profiles and heatmaps in narrowPeak
Ratio-over-input profile
Ratio-over-input stats
Ratio-over-input profiles and heatmaps
Program names and arguments
When the intron splicing sites are focus of studies, we want to inspect signals at both 5’ splicing sites and 3’ splicing sites, perhaps want to contrast with the center of introns as well, then displaying all 3 pieces of information in the same panel would be desirable.
txdb <- AnnotationDbi::loadDb(system.file("extdata", "txdb_chr19.sql", package="GenomicPlot"))
queryfiles <- system.file("extdata", "treat_chr19.bam", package="GenomicPlot")
names(queryfiles) <- "clip_bam"
inputfiles <- system.file("extdata", "input_chr19.bam", package="GenomicPlot")
names(inputfiles) <- "clip_input"
ext <- c(-500, 200, -200, 500)
hl <- c(-50, 50, -50, 50)
importParams <- list(offset=-1, fix_width=0, fix_point="start", norm=TRUE,
useScore=FALSE, outRle=TRUE, useSizeFactor=FALSE, genome="hg19")
plot_start_end(queryFiles=queryfiles,
inputFiles=inputfiles,
txdb=txdb,
centerFiles="intron",
binSize=10,
importParams=importParams,
ext=ext,
hl=hl,
insert=100,
stranded=TRUE,
scale=FALSE,
smooth=TRUE,
outPrefix=NULL,
nc=5)Query and input sample profiles in start, center and end of intron
Ratio-over profile in start, center and end of intron
Difference in signal intensity within specific regions around the reference loci can be tested, and the test statistics can be plotted as boxplot and barplot of mean +/- SE. The test can be done among loci or among samples.
centerfiles <- c(system.file("extdata", "test_clip_peak_chr19.bed", package="GenomicPlot"),
system.file("extdata", "test_chip_peak_chr19.bed", package="GenomicPlot"))
names(centerfiles) <- c("iCLIPPeak", "SummitPeak")
queryfiles <- c(system.file("extdata", "treat_chr19.bam", package="GenomicPlot"),
system.file("extdata", "chip_treat_chr19.bam", package="GenomicPlot"))
names(queryfiles) <- c("clip_bam", "chip_bam")
inputfiles <- c(system.file("extdata", "input_chr19.bam", package="GenomicPlot"),
system.file("extdata", "chip_input_chr19.bam", package="GenomicPlot"))
names(inputfiles) <- c("clip_input", "chip_input")
importParams <- list(offset=0, fix_width=150, fix_point="start", norm=TRUE,
useScore=FALSE, outRle=TRUE, useSizeFactor=TRUE, genome="hg19")
plot_locus(queryFiles=queryfiles,
centerFiles=centerfiles,
ext=c(-500,500),
hl=c(-100,100),
shade=TRUE,
smooth=TRUE,
importParams=importParams,
binSize=10,
refPoint="center",
Xlab="Center",
inputFiles=inputfiles,
stranded=TRUE,
scale=FALSE,
outPrefix=NULL,
verbose=FALSE,
transform=NA,
rmOutlier=0,
statsMethod="wilcox.test",
heatmap=TRUE,
nc=5)Ratio-over-input for clip signal around clip and chip peaks
Ratio-over-input for clip signal around clip and chip peaks
Ratio-over-input for chip signal around clip and chip peaks
Ratio-over-input for chip signal around clip and chip peaks
Ratio-over-input for chip and clip signal around clip peaks
Ratio-over-input for chip and clip signal around clip peaks
Ratio-over-input for chip and clip signal around chip peaks
Ratio-over-input for chip and clip signal around chip peaks
Ratio-over-input profiles for chip and clip signal around clip and chip peaks
Ratio-over-input profiles and heatmap for chip and clip signal
Random sites that deviate from the reference loci within a specified maximum distance can be treated as a control, and the statistical tests will be performed between bona fide and the random sites. Here the reference sites located in different genomic features (5’UTR, CDS, 3’UTR) are tested separately. To avoid sampling bias, increase “n_random”.
txdb <- AnnotationDbi::loadDb(system.file("extdata", "txdb_chr19.sql", package="GenomicPlot"))
centerfiles <- c(system.file("extdata", "test_clip_peak_chr19.bed", package="GenomicPlot"))
names(centerfiles) <- c("iCLIPPeak")
queryfiles <- c(system.file("extdata", "treat_chr19.bam", package="GenomicPlot"))
names(queryfiles) <- c("clip_bam")
inputfiles <- c(system.file("extdata", "input_chr19.bam", package="GenomicPlot"))
names(inputfiles) <- c("clip_input")
importParams <- list(offset=-1, fix_width=150, fix_point="start", norm=TRUE,
useScore=FALSE, outRle=TRUE, useSizeFactor=TRUE, genome="hg19")
plot_locus_with_random(queryFiles=queryfiles,
centerFiles=centerfiles,
txdb,
ext=c(-500,500),
hl=c(-100,100),
shade=TRUE,
importParams=importParams,
binSize=10,
refPoint="center",
Xlab="Center",
smooth=TRUE,
inputFiles=inputfiles,
stranded=TRUE,
scale=FALSE,
outPrefix=NULL,
verbose=FALSE,
transform=NA,
rmOutlier=0,
n_random=1,
statsMethod="wilcox.test",
nc=5)Ratio-over-input for clip signal around clip peaks in 5’UTR
Ratio-over-input for clip signal around clip peaks in 5’UTR
Ratio-over-input for clip signal around clip peaks
Ratio-over-input for clip signal around clip peaks
Aside from reads coverage profiles, distribution of binding peaks in different gene types and genomic features is also important. Peak annotation statistics are plotted as bar chart for distribution in gene types, and as pie charts for distribution in genomic features. The pie charts are plotted in two different ways: either as percentage of absolute counts or as percentage of feature length-normalized counts in each features. For DNA binding samples, the features, in order of precedence, include “Promoter”, “TTS”, “5’UTR”, “CDS”, “3’UTR” and “Intron”; for RNA binding samples, “Promoter” and “TTS” are excluded. “Promoter” is defined as regions 2000 bp upstream of transcription start site (TSS) and 300 bp downstream TSS, “TTS” is defined as the region 1000 bp downstream cleavage site or the length between cleavage site and the start of the next gene, whichever is shorter, but these lengths can be adjusted. To save annotation results (both peak-oriented and gene-oriented), set “verbose”=TRUE.
gtffile <- system.file("extdata", "gencode.v19.annotation_chr19.gtf", package="GenomicPlot")
centerfile <- system.file("extdata", "test_chip_peak_chr19.bed", package="GenomicPlot")
names(centerfile) <- c("SummitPeak")
handleBedparams <- list(fix_width=0, fix_point="center", useScore=FALSE, outRle=FALSE,
offset=0, norm=FALSE, useSizeFactor=FALSE, genome="hg19")
pa <- plot_peak_annotation(peakFile=centerfile,
gtfFile=gtffile,
importParams=handleBedparams,
fiveP=-2000,
dsTSS=300,
threeP=1000,
simple=FALSE,
verbose=TRUE,
outPrefix=NULL)Distribution of chip peak in various types of gene
Distribution of chip peak in various parts of gene
Density of chip peak in various parts of gene
sessionInfo()
#> R version 4.2.1 (2022-06-23 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19044)
#>
#> Matrix products: default
#>
#> locale:
#> [1] LC_COLLATE=C
#> [2] LC_CTYPE=English_United States.utf8
#> [3] LC_MONETARY=English_United States.utf8
#> [4] LC_NUMERIC=C
#> [5] LC_TIME=English_United States.utf8
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] GenomicPlot_0.1.4 GenomicFeatures_1.49.7 AnnotationDbi_1.59.1
#> [4] Biobase_2.57.1 GenomicRanges_1.49.0 GenomeInfoDb_1.33.10
#> [7] IRanges_2.31.2 S4Vectors_0.35.4 BiocGenerics_0.43.4
#>
#> loaded via a namespace (and not attached):
#> [1] utf8_1.2.2 tidyselect_1.2.0
#> [3] RSQLite_2.2.18 htmlwidgets_1.5.4
#> [5] grid_4.2.1 ranger_0.14.1
#> [7] BiocParallel_1.31.13 devtools_2.4.5
#> [9] munsell_0.5.0 codetools_0.2-18
#> [11] DT_0.25 miniUI_0.1.1.1
#> [13] withr_2.5.0 colorspace_2.0-3
#> [15] filelock_1.0.2 highr_0.9
#> [17] knitr_1.40 rstudioapi_0.14
#> [19] ggsignif_0.6.4 MatrixGenerics_1.9.1
#> [21] GenomeInfoDbData_1.2.9 seqPattern_1.29.0
#> [23] bit64_4.0.5 pheatmap_1.0.12
#> [25] rprojroot_2.0.3 vctrs_0.4.2
#> [27] generics_0.1.3 lambda.r_1.2.4
#> [29] xfun_0.33 BiocFileCache_2.5.2
#> [31] ggseqlogo_0.1 R6_2.5.1
#> [33] doParallel_1.0.17 clue_0.3-61
#> [35] locfit_1.5-9.6 bitops_1.0-7
#> [37] cachem_1.0.6 gridGraphics_0.5-1
#> [39] DelayedArray_0.23.2 assertthat_0.2.1
#> [41] promises_1.2.0.1 BiocIO_1.7.1
#> [43] scales_1.2.1 gtable_0.3.1
#> [45] processx_3.7.0 rlang_1.0.6
#> [47] GlobalOptions_0.1.2 rstatix_0.7.0
#> [49] rtracklayer_1.57.0 lazyeval_0.2.2
#> [51] impute_1.71.0 broom_1.0.1
#> [53] plyranges_1.17.0 abind_1.4-5
#> [55] yaml_2.3.5 reshape2_1.4.4
#> [57] backports_1.4.1 httpuv_1.6.6
#> [59] RCAS_1.23.0 tools_4.2.1
#> [61] usethis_2.1.6 ggplotify_0.1.0
#> [63] gridBase_0.4-7 ggplot2_3.3.6
#> [65] ellipsis_0.3.2 jquerylib_0.1.4
#> [67] RColorBrewer_1.1-3 proxy_0.4-27
#> [69] sessioninfo_1.2.2 Rcpp_1.0.9
#> [71] plyr_1.8.7 progress_1.2.2
#> [73] zlibbioc_1.43.0 purrr_0.3.5
#> [75] RCurl_1.98-1.9 ps_1.7.1
#> [77] prettyunits_1.1.1 ggpubr_0.4.0
#> [79] pbapply_1.5-0 GetoptLong_1.0.5
#> [81] viridis_0.6.2 cowplot_1.1.1
#> [83] urlchecker_1.0.1 SummarizedExperiment_1.27.3
#> [85] cluster_2.1.4 fs_1.5.2
#> [87] magrittr_2.0.3 data.table_1.14.2
#> [89] futile.options_1.0.1 circlize_0.4.15
#> [91] matrixStats_0.62.0 pkgload_1.3.0
#> [93] hms_1.1.2 mime_0.12
#> [95] evaluate_0.17 xtable_1.8-4
#> [97] XML_3.99-0.11 VennDiagram_1.7.3
#> [99] gridExtra_2.3 shape_1.4.6
#> [101] compiler_4.2.1 biomaRt_2.53.3
#> [103] tibble_3.1.8 KernSmooth_2.23-20
#> [105] crayon_1.5.2 BSgenome.Hsapiens.UCSC.hg19_1.4.3
#> [107] htmltools_0.5.3 later_1.3.0
#> [109] tzdb_0.3.0 tidyr_1.2.1
#> [111] DBI_1.1.3 formatR_1.12
#> [113] genomation_1.29.0 gprofiler2_0.2.1
#> [115] dbplyr_2.2.1 ComplexHeatmap_2.13.3
#> [117] rappdirs_0.3.3 car_3.1-0
#> [119] Matrix_1.5-1 readr_2.1.3
#> [121] cli_3.6.1 parallel_4.2.1
#> [123] forcats_0.5.2 pkgconfig_2.0.3
#> [125] GenomicAlignments_1.33.1 plotly_4.10.0
#> [127] xml2_1.3.3 roxygen2_7.2.1
#> [129] foreach_1.5.2 bslib_0.4.0
#> [131] XVector_0.37.1 yulab.utils_0.0.5
#> [133] stringr_1.4.1 callr_3.7.2
#> [135] digest_0.6.29 Biostrings_2.65.6
#> [137] rmarkdown_2.17 edgeR_3.39.6
#> [139] restfulr_0.0.15 curl_4.3.3
#> [141] shiny_1.7.2 Rsamtools_2.13.4
#> [143] rjson_0.2.21 lifecycle_1.0.3
#> [145] jsonlite_1.8.2 carData_3.0-5
#> [147] futile.logger_1.4.3 desc_1.4.2
#> [149] viridisLite_0.4.1 limma_3.53.10
#> [151] BSgenome_1.65.2 fansi_1.0.3
#> [153] pillar_1.8.1 ggsci_2.9
#> [155] lattice_0.20-45 KEGGREST_1.37.3
#> [157] fastmap_1.1.0 httr_1.4.4
#> [159] plotrix_3.8-2 pkgbuild_1.3.1
#> [161] glue_1.6.2 remotes_2.4.2
#> [163] png_0.1-7 iterators_1.0.14
#> [165] bit_4.0.4 sass_0.4.2
#> [167] stringi_1.7.8 profvis_0.3.7
#> [169] blob_1.2.3 memoise_2.0.1
#> [171] dplyr_1.0.10